The world of data has experienced unprecedented growth.
Natural Language Processing (NLP)
NLP: the way computers read text and imitate human language.
We can apply NLP techinques with quanteda: more easy to do research. (Tokenization, Stopwords, and part of speeches)
Quanteda is a package that gives you the
power of process, wrangle and analyze text in multiple
ways.
It’s easy to use and the applications that has are enormous.
Quantitative and Qualitative Analysis: best of both worlds in one single package.
Text analysis: best way to do it.

A lot of data is in text form, many tools convert audios into text and there is a lot of text data on webpages and social media.
Social science:

Three things
📖 1.-Corpus: the original data that will be pre-processed and analyzed.
🛠 2.-Tokens: Tokenization storing the words of our texts for further analysis.
📑 3.-Document Feature Matrix (DFM): helps us analyze and store the features of a text.
📚 Text files: Quanteda uses readtextpackage. We can
read .txt, .csv, .tab, .json. files.
Even, we can read .pdf, .doc and .docx files.
Amazing, right? For our tutorial, we will use txt files.

Important things:
📚 We can create a corpus from:
Character vectors c()
🖼 Dataframes that contain one column with a string or a text to be analyzed.
⛔ IMPORTANT ⛔: your string variable of your df must be name as text
SimpleCorpus from tm package.
Here you can appreciate with our exercises what we can obtain.
1.-Text: Name of our document. In our case, the names are the episodes titles of HIMYM.
2.-Types: Different types of features that we can wrangle.
3.-Tokens: Number of tokens that our documents have.
4.-Sentences: Number of sentences per document. In our case, TV scripts.
5.-Chapter and No.overall are variables that we added. We will explain that later.

Important things:
Tokens are just characters that segments texts into tokens (mainly words or sentences) by word boundaries.
📚 What a token object contains:
😎 Why tokenization is awesome?
You have functions like
remove_separatorsremove_numbersremove_symbolsHere you can appreciate with our exercises what we can obtain.
You can see the words that are separated.


Important things:
DFM objects are super useful because we can do stats with them and analysis in general.
📜 What a DFM object contains:
A matrix is a 2 dimensional array with m rows, and n columns.
In a dfm each row represents a document, and each column represents a feature.
Enables us to identify the most frequent features of a document.
Analyzes text based on the “bag of words” model.
Here you can appreciate with our exercises what we can obtain.
You can see the features.

Source: our amazing classmates from the MDS 2023: Laura Menicacci & Dinah Rabe
Main parts:
Remember that you can use a pipe%>%for all the
functions of the package.
First step: corpus(your_dataframe, text, etc) =
Creates a corpus object from available sources.
Second step: tokens(your_corpus_object) = Construct
a tokens object.
Third step: dfm(your_token_object) = Construct a
sparse document-feature matrix, from a character, corpus, tokens, or
even other dfm object.
Remember that you can use a pipe%>%for all the
functions of the package.
docnames(your_corpus) = rename you docvars.
corpus_subset() = subsets of a corpus that meet certain
condition. Like a filter.
corpus_group(your_text_object, dataframe, etc) = Combine
documents in a corpus object by a grouping variable.
corpus_trim(your_text_object, dataframe, etc) = Removes
sentences from a corpus or a character vector shorter than a specified
length.
corpus_segment(your_text_object, dataframe, etc) =
Segment corpus text(s) or a character vector, splitting on a pattern
match.
Remember that you can use a pipe%>%for all the
functions of the package.
tokens() = Construct a tokens object.
tokens_select(your_token_obj) = These function select
or discard tokens from a tokens object.
tokens_remove(your_token_obj) = Same as tokens select,
but we remove words, phrases, etc.tokens_keep(your_token_obj) = Same as tokens select,
but we keep words, phrases, etc.tokens_group(your_token_obj) = Combine documents in a
tokens object by a grouping variable.
tokens_tolower(your_token_obj) = Convert the features of
a tokens object and re-index the types. All to lower cases.
Remember that you can use a pipe %>% for all the
functions of the package.
dfm(your_token_obj) = Construct a sparse
document-feature matrix.
dfm_lookup(your_token_obj) = Apply a dictionary to a dfm
by looking up all dfm features for matches.
dfm_match(your_token_obj) = Match the feature set of a
dfm to a specified vector of feature names.
dfm_subset(your_token_obj) = Returns document subsets of
a dfm that meet certain condition
This workshop aims to use the incredible quanteda
package to analyze the television series “How I Met Your Mother” and
demonstrate many of the quanteda package’s tools. We will
explore the characters, identify adjectives, render Wordclouds, network
plots and even sentiment analysis.
Plot: “Ted has fallen in love. It all started when his best friend, Marshall, drops the bombshell that he plans to propose to longtime girlfriend Lily, a kindergarten teacher. Suddenly, Ted realizes that he had better get a move on if he hopes to find true love. Helping him in the quest is Barney, a friend with endless – often outrageous – opinions, a penchant for suits and a foolproof way to meet women. When Ted meets Robin, he is sure it’s love at first sight, but the affair fizzles into friendship. Voice-over by Bob Saget (”Full House”) tells the story through flashbacks.”
Source: Rotten Tomatoes
![]()
Ted
Actor: Josh Radnor

Barney
Actor: Neil Patrick Harris

Robin
Actor: Cobin Smulders

Marshall
Actor: Jason Segel

Lily
Actor: Alyson Hannigan
“The story of five friends sitting in their favorite booth at MacLaren’s, their lives unfolding in front of each other, How I Met Your Mother is heartwarming and hilarious at the same time. Some believe that HIMYM is Ted’s story. Others think that it is Marshall and Lily’s story. And there’s a whole school of thought that it’s no one else but Barney’s story. We would like to think that it’s all of their stories because there won’t be a Ted without Barney or a Lily without Marshall, and definitely no Robin without a Ted (and Barney too). That’s how crucial each of the members of this group is, playing a major role in each other’s lives, helping them grow and become what they wanted to be.”
Source: Collider

These will be the libraries we will use for our analysis. In every line, you will find the purpose of it.
library(readtext) #For import and Handling for Plain and Formatted Text Files.
library(rvest) #For easily Harvest (Scrape) Web Pages.
library(xml2) #For working with XML files using a simple, consistent interface.
library(polite) #For be responsible when scraping data from websites.
library(httr) #Package for working with HTTP organised by HTTP verbs
library(tidyverse) #Opinionated collection of R packages designed for data science.
library(tidytext) #Functions and supporting data sets to allow conversion of text.
library(quanteda) #OUR PACKAGE for text analysis.
library(quanteda.textstats) #OUR SUBPACKAGE for text statistics.
library(quanteda.textplots) #OUR SUBPACKAGE for text plots.
library(stringr) #Consistent Wrappers for Common String Operations.
library(spacyr) #NLP package that comes from Python that help us classify words.
library(ggsci) #Collection of high-quality color palettes.
library(ggrepel) # ggrepel provides geoms for ggplot2 to repel overlapping text labels
library(RColorBrewer) #Beautifull color palettes.
library(cowplot) #Package to put images in our plots.
library(magick) #Package for save images in our environment
library(gghighlight) #gghighlight() adds direct labels for some geoms.
#Set image
obj_img <- image_read(path = "https://bit.ly/3twmH2Y")We will do a web scraping of our favorite TV show: “How I Met Your Mother.” For the above, we will do web scraping to obtain the scripts of the 208 episodes that the TV show has. We will define the URLs, obtain the information to know if we can do web scraping, and name the directory where we want to save our files.
v_tv_show <- "how-i-met-your-mother"
v_url_web <- "http://www.springfieldspringfield.co.uk/"
#Remember to be polite and know if we can web scrap the webpage
session_information <- bow(v_url_web) #Do a bow with the polite package
session_information
v_url <- paste(v_url_web,"episode_scripts.php?tv-show=", v_tv_show, sep="")
#Identify yourself
rvest_himym <- session(v_url,
add_headers(`From` = "jurjoo@gmail.com",
`UserAgent` = R.Version()$version.string))
#Start web scrap
html_url_scrape <- rvest_himym %>% read_html(v_url)
node_selector <- ".season-episode-title"
directory_path <- paste("texts/how-i-met-your-mother/", v_tv_show, sep = "")### scrape href nodes in .season-episode-title-------------------------
html_url_all_seasons <- html_nodes(html_url_scrape, node_selector) %>%
html_attr("href")
### One loop for all our URL's----------------------------------------
for (x in html_url_all_seasons) {
read_ur <- read_html(paste(v_url_web, x, sep="/"))
Sys.sleep(runif(1, 0, 1)) #Be polite
# Element node that was checked and that contain the place of the scripts.
selector <- ".scrolling-script-container"
# Scrape the text
text_html <- html_nodes(read_ur, selector) %>%
html_text()
# Last five characters of html_url_all_seasons for saving this to separate text files (This is our pattern).
sub_data <- function(x, n) {
substr(x, nchar(x) - n + 1, nchar(x))
}
seasons_final <- sub_data(x, 5)
# Write each text file
write.csv(text_html, file = paste(directory_path, "_", seasons_final, ".txt", sep=""), row.names = FALSE)
}It’s important to attach our scripts with relevant information about them. For example, episode title, number of episode, number of season, director, etc. That’s why we will webscrap this information from the internet.
url_himym <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_episodes"
rvest_himym_table <- session(url_himym,
add_headers(`From` = "jurjoo@gmail.com",
`UserAgent` = R.Version()$version.string))
l_tables_himym <- rvest_himym_table %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)
#This generates a list with all the tables that contain the page. In our case,
#we want the table from the second element till the 10th.
l_tables_himym <- l_tables_himym[c(2:10)]
l_tables_himym[1]## [[1]]
## # A tibble: 22 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 "\"Pilot\"" Pamela… Carter… Septem… 1ALH79 10.94[…
## 2 2 2 "\"Purple … Pamela… Carter… Septem… 1ALH01 10.40[…
## 3 3 3 "\"Sweet T… Pamela… Phil L… Octobe… 1ALH02 10.44[…
## 4 4 4 "\"Return … Pamela… Kourtn… Octobe… 1ALH03 9.84[1…
## 5 5 5 "\"Okay Aw… Pamela… Chris … Octobe… 1ALH04 10.14[…
## 6 6 6 "\"Slutty … Pamela… Brenda… Octobe… 1ALH05 10.89[…
## 7 7 7 "\"Matchma… Pamela… Chris … Novemb… 1ALH07 10.55[…
## 8 8 8 "\"The Due… Pamela… Gloria… Novemb… 1ALH06 10.35[…
## 9 9 9 "\"Belly F… Pamela… Phil L… Novemb… 1ALH09 10.29[…
## 10 10 10 "\"The Pin… Pamela… Carter… Novemb… 1ALH08 12.27[…
## # … with 12 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
Of course, we must clean our tables to have a final dataframe with the texts and the information of every episode.
#Reduce the list in one data frame since all of the tables share the same structure
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym))
#We do the same for the characters of HIMYM
url_himym_characters <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_characters"
rvest_himym_table_2 <- session(url_himym_characters,
add_headers(`From` = "jurjoo@gmail.com",
`UserAgent` = R.Version()$version.string))
l_tables_himym_characters <- rvest_himym_table_2 %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)
df_characters <- as.data.frame(l_tables_himym_characters[[1]]) %>%
select(Character)
df_characters_w <- df_characters %>%
filter(!stringr::str_starts(Character, "Futu"),
!(Character %in% c("Character", "Main Characters",
"Supporting Characters"))) %>%
mutate(name = str_extract(Character,"([^ ]+)"),
name = replace(name, name == "Dr.", "Sonya"))
df_characters_wLook how with our code we cleaned the information of the TV Show and know we have it in a dataframe.
#We bind the tables with bind_rows
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym))
df_himym_filt <- df_himym %>% filter(str_length(No.overall) < 4)
df_himym_filt_dupl <- df_himym %>% filter(str_length(No.overall) > 4)
#We are doing this particular wrangling to format in the best possible way our tables.
#Note that we are using stringr to manipulate our characters.
df_himym_filt_dupl_1 <- df_himym_filt_dupl %>%
mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 1, 3))),
No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 1, 2))),
Prod.code = replace (Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 1, 6)))
df_himym_filt_dupl_2 <- df_himym_filt_dupl %>%
mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 4, 6))),
No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 3, 4))),
Title = replace(Title, Title == "\"The Magician's Code\"", "\"The Magician's Code Part 2\""),
Title = replace(Title, Title == "\"The Final Page\"", "\"The Final Page Part 2\""),
Title = replace(Title, Title == "\"Last Forever\"" , "\"Last Forever Part 2\"" ),
Prod.code = replace(Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 7, 12)))
df_himym_final <- bind_rows(df_himym_filt,
df_himym_filt_dupl_1,
df_himym_filt_dupl_2) %>%
arrange(No.overall, No..inseason) %>%
mutate(year = str_extract(Original.air.date, '[0-9]{4}+'),
Season = as.numeric(stringr::str_extract(Prod.code, "^.{1}"))) %>%
rename(Chapter = No..inseason)
df_himym_final$US.viewers.millions. <- as.numeric(str_replace_all(df_himym_final$US.viewers.millions., "\\[[0-9]+\\]", ""))
df_himym_finalThis is the final and most important step of the web scrap. Here, we are merging our TV show scripts and the information of the episodes in one single dataframe.
df_texts_himym <- readtext::readtext("texts/how-i-met-your-mother/*.txt")
v_season <- as.numeric(stringr::str_extract(df_texts_himym$doc_id, "\\d+"))
v_chapter <- as.numeric(stringi::stri_extract_last_regex(df_texts_himym$doc_id, "[0-9]+"))
df_texts_himym_w <- df_texts_himym %>% mutate(Season = v_season, Chapter = v_chapter)
df_himym_final_doc <- full_join(as.data.frame(df_texts_himym_w), df_himym_final, by = c("Season", "Chapter")) %>%
mutate(Season_w = paste("Season", Season),
Title_season = paste0(Title, " S", Season, " EP", Chapter))
df_himym_final_docPress the arrows in the top right corner of this interactive dataframe. As you can see, we have our final dataframe with the information our our TV show, number of season, episode, etc.
Once, we have our final dataframe, now we can start our analysis
using the quanteda package.
Look our corpus, it’s divided into types, tokens and even sentences.
corp_himym <- corpus(df_himym_final_doc) #Build a new corpus from the texts
docnames(corp_himym) <- df_himym_final_doc$Title #With docnames() we can change the name of our texts.
#In our case, we putted the title of the episodes.
summary(corp_himym, n = 15)Look our tokenization, we separate our text into words. Amazing!
corp_himym_stat <- corp_himym
docnames(corp_himym_stat) <- df_himym_final_doc$Title_season
corp_himym_s1_simil <- corpus_subset(corp_himym_stat, Season == 1) #We want to analyze just the first season
toks_himym_s1 <- tokens(corp_himym_s1_simil, #corpus from all the episodes from the first season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_remove(stopwords("english")) #Remove stop words of our texts
toks_himym_s1## Tokens consisting of 22 documents and 12 docvars.
## "Pilot" S1 EP1 :
## [1] "x" "OLDER" "TED" "Kids" "gonna"
## [6] "tell" "incredible" "story" "story" "met"
## [11] "mother" "punished"
## [ ... and 1,462 more ]
##
## "Purple Giraffe" S1 EP2 :
## [1] "x" "OLDER" "TED" "Okay" "telling"
## [6] "us" "met" "Mom" "excruciating" "detail"
## [11] "Right" "back"
## [ ... and 1,374 more ]
##
## "Sweet Taste of Liberty" S1 EP3 :
## [1] "x" "S" "Sy" "Syn" "Sync" "Sync" "b" "OLDER" "TED"
## [10] "one" "night" "met"
## [ ... and 1,350 more ]
##
## "Return of the Shirt" S1 EP4 :
## [1] "x" "OLDER" "TED"
## [4] "Kids" "single" "looking"
## [7] "happily-ever-after" "one" "stories"
## [10] "can" "end" "way"
## [ ... and 1,477 more ]
##
## "Okay Awesome" S1 EP5 :
## [1] "x" "OLDER" "TED" "kids" "like" "hear" "story" "time" "went"
## [10] "deaf" "even" "ask"
## [ ... and 1,138 more ]
##
## "Slutty Pumpkin" S1 EP6 :
## [1] "x" "OLDER" "TED" "know" "Aunt" "Robin's"
## [7] "big" "fan" "Halloween" "Always" "dressing" "crazy"
## [ ... and 1,405 more ]
##
## [ reached max_ndoc ... 16 more documents ]
Please, take a look into our Document Feature Matrix. Look know how it is counting our ocurrences. We can do multiple things with them.
toks_himym_dm_s1 <- toks_himym_s1 %>%
dfm() #Convert our tokens into a document feature matrix
toks_himym_dm_s1## Document-feature matrix of: 22 documents, 4,890 features (87.39% sparse) and 12 docvars.
## features
## docs x older ted kids gonna tell incredible story
## "Pilot" S1 EP1 1 1 22 3 22 6 1 7
## "Purple Giraffe" S1 EP2 1 5 30 2 19 1 1 1
## "Sweet Taste of Liberty" S1 EP3 1 3 27 1 15 6 0 2
## "Return of the Shirt" S1 EP4 1 5 14 1 15 4 0 6
## "Okay Awesome" S1 EP5 1 3 9 2 11 4 0 5
## "Slutty Pumpkin" S1 EP6 1 1 16 0 15 1 0 3
## features
## docs met mother
## "Pilot" S1 EP1 11 1
## "Purple Giraffe" S1 EP2 10 0
## "Sweet Taste of Liberty" S1 EP3 1 1
## "Return of the Shirt" S1 EP4 0 0
## "Okay Awesome" S1 EP5 0 0
## "Slutty Pumpkin" S1 EP6 4 0
## [ reached max_ndoc ... 16 more documents, reached max_nfeat ... 4,880 more features ]
textstat_simil() function. It’s super useful because we will find the similarity between episodes for the first season.
tstat_simil <- textstat_simil(toks_himym_dm_s1) #Check similarity between episodes of the first season
clust <- hclust(as.dist(tstat_simil)) #Convert our object into a cluster (For visualization purposes)
dclust <- as.dendrogram(clust) #Convert our cluster into a dendrogram (For visualization purposes)
dclust <- reorder(dclust, 1:22) #Order our visualization#Seetle colors
nodePar <- list(lab.cex = 1, pch = c(NA, 19),
cex.axis = 1.5,
cex = 2, col = "#0080ff")
par(mar = c(18.1, 6, 2, 3))
#Plot dendogram
plot(dclust, nodePar = nodePar,
las = 1,
cex.axis = 2, cex.main = 2, cex.sub = 2,
main = "How I Met Your Mother Season 1",
type = "triangle",
ylim = c(0,1),
ylab = "",
edgePar = list(col = 4:7, lwd = 7:7),
panel.first = abline(h = c(seq(.10, 1, .10)), col = "grey80"))
title(ylab = "Similarity between episodes (correlation %)", mgp = c(4, 1, 1), cex.lab = 2)
rect.hclust(clust, k = 5, border = "red")Look how amazing the similarity it is. The similarity is higher for episodes like “Zip, Zip, Zip” and Cupcake. And, for the episodes that are less similars are “The Pineapple Incident” and “The Limo”.

textstat_dist() function. Here distance is the opposed of similarity. More distance equals less similar. Rememebr our similarity chart? Well, is the same, but here we are obtaining distance.
tstat_dist <- textstat_dist(toks_himym_dm_s1) #Check similarity between episodes of the first season
clust <- hclust(as.dist(tstat_dist)) #Convert our object into a cluster (For visualization purposes)
dclust_dist <- as.dendrogram(clust) #Convert our cluster into a dendrogram (For visualization purposes)
dclust_dist <- reorder(dclust, 1:22) #Order our visualizationpar(mar = c(21, 6, 2, 3))
#Plot dendogram
plot(dclust_dist, nodePar = nodePar_2,
las = 1,
cex.axis = 2, cex.main = 2, cex.sub = 2,
main = "How I Met Your Mother Season 1",
type = "triangle",
ylim = c(0, 120),
ylab = "",
edgePar = list(col = 11:19, lwd = 7:7),
panel.first = abline(h = c(seq(10, 120, 10)), col = "grey80"))
title(ylab = "Distance between episodes (correlation %)", mgp = c(4, 1, 1), cex.lab = 2)
rect.hclust(clust, k = 5, border = "red")Remember the episodes “The Pineapple Incident” and “The Limo” are the less similar ones? Well, here these episodes are the ones that have more distance between them.

Now, we want to know the characters of the TV Show. We will get the number of appearances by actor per season and episode.
#Remember our second step: tokenize our corpus.
toks_himym <- tokens(corp_himym, #corpus from all the episodes
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_remove(stopwords("english")) #Add additional words
dfm_actors <- toks_himym %>%
tokens_select(c("Ted", "Marshall", "Lily", "Robin", "Barney", "Mother")) %>% #We just want to analyze these characters
tokens_group(groups = Season) %>% #We group our tokens (scripts) by season
dfm() #Transform the token into a DFM object
df_final_actors <- as.data.frame(textstat_frequency(dfm_actors, groups = c(1:9))) %>%
mutate(Season = paste("Season", group),
`Principal Characters` = replace(feature, is.character(feature), str_to_title(feature))) %>%
select(-feature)
df_final_actorsHere, we plot this frequency of actors. There is no secret that Ted is the most famous character of the show because he is the one that is telling the story to his sons. It’s interesting how Barney had a “glow up” for the end of the last season. What is ironic is that the TV Show is called “How I Met Your Mother”, but the times that “Mother” appears on the seasons is not that much.
# Plot frequency of actors
ggplot1 <- ggplot(df_final_actors, aes(x = group, y = frequency, group = `Principal Characters`, color = `Principal Characters`)) +
geom_line(size = 1.5) +
scale_color_manual(values = brewer.pal(n = 6, name = "Dark2")) +
geom_point(size = 3.2) +
scale_y_continuous(breaks = seq(0, 5600, by = 50), limits = c(0,560))+
theme_minimal(base_size = 14) +
labs(x = "Number of Season",
y = "Frequencies of appreances",
title = "Appearances of principal characters by Season",
caption="Description: This plot show the number of times that the \n principal characters appears in HIMYM per season.")+
theme(panel.grid.major=element_line(colour="#cfe7f3"),
panel.grid.minor=element_line(colour="#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption=element_text(size=12, hjust=.1, color="#939393"),
legend.position="bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 40, # Bottom margin
l = 10), # Left margin
text=element_text()) +
#geom_segment(aes(x = 8.5, y = 75, xend = 8.8, yend = 70),
# arrow = arrow(length = unit(0.1, "cm")))+
guides(colour = guide_legend(ncol = 6))
ggdraw(ggplot1) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
Wordcloud plots are super useful to analyze how many words and the repetition of them appears in a text. Knowing this, we want to do some analysis using wordclouds.
#Remember our second step: tokenize our corpus.
toks_himym_characters <- tokens(corp_himym, #corpus from all the episodes from all season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_keep(c(unique(df_characters_w$name))) #This function allow us to keep just the tokens that we want.
#In this case, we just want the characters.
toks_himym_characters## Tokens consisting of 208 documents and 12 docvars.
## "Pilot" :
## [1] "TED" "Marshall" "Marshall" "Ted" "Lily" "Lily"
## [7] "Marshall" "Marshall" "Barney" "Marshall" "Lily" "Marshall"
## [ ... and 58 more ]
##
## "Purple Giraffe" :
## [1] "TED" "Robin" "Barney" "Ted" "Ted" "Ted" "Robin" "Robin"
## [9] "Lily" "Ted" "Ted" "Lily"
## [ ... and 61 more ]
##
## "Sweet Taste of Liberty" :
## [1] "TED" "Marshall" "Lily" "Barney" "Robin" "Ted"
## [7] "TED" "Barney" "Ted" "Marshall" "Lily" "Marshall"
## [ ... and 54 more ]
##
## "Return of the Shirt" :
## [1] "TED" "TED" "Barney" "Robin" "Robin" "Ted"
## [7] "Ted" "Lily" "Lily" "Marshall" "Marshall" "Lily"
## [ ... and 18 more ]
##
## "Okay Awesome" :
## [1] "TED" "Robin" "Marshall" "Lily" "TED" "Lily"
## [7] "Lily" "Marshall" "Ted" "Ted" "Marshall" "Marshall"
## [ ... and 27 more ]
##
## "Slutty Pumpkin" :
## [1] "TED" "Lily" "Robin" "Robin" "Ted" "Ted" "Ted" "TED" "TED"
## [10] "Ted" "LILY" "Ted"
## [ ... and 24 more ]
##
## [ reached max_ndoc ... 202 more documents ]
#Remember our third step: DFM object
dfm_general_characters <- toks_himym_characters %>%
dfm()
dfm_general_characters## Document-feature matrix of: 208 documents, 65 features (88.51% sparse) and 12 docvars.
## features
## docs ted marshall lily barney carl robin ranjit don
## "Pilot" 22 18 14 6 5 3 2 0
## "Purple Giraffe" 30 5 6 4 0 27 0 1
## "Sweet Taste of Liberty" 27 15 7 10 0 7 0 0
## "Return of the Shirt" 14 3 3 2 0 7 0 0
## "Okay Awesome" 9 13 6 7 0 4 0 0
## "Slutty Pumpkin" 16 5 5 3 0 6 0 0
## features
## docs mickey gary
## "Pilot" 0 0
## "Purple Giraffe" 0 0
## "Sweet Taste of Liberty" 0 0
## "Return of the Shirt" 1 0
## "Okay Awesome" 0 0
## "Slutty Pumpkin" 0 1
## [ reached max_ndoc ... 202 more documents, reached max_nfeat ... 55 more features ]
textplot_wordcloud(dfm_general_characters,
rotation = 0.25,
min_size = 1.4, max_size = 8,
min_count = 1, #Minimum frequency
color = brewer.pal(11, "RdBu"))
#RColorBrewer::display.brewer.all()And here we have it. Our first wordcloud plot. Looks amazing! Remember that you can change the color of it, the sizes and other relevant things.

Now, let’s do the same, but just with our secondary characters.
#Remember our second step: tokenize our corpus.
toks_himym_sec_characters <- tokens(corp_himym, #corpus from all the episodes from all season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_keep(c(unique(df_characters_w$name))) %>% #We want to keep all the characters
tokens_remove(c("Ted", "Barney", "Lily", "Robin", "Marshall")) #But we remove the principal characters#Remember our third step: DFM object
dfm_general_sec_characters <- toks_himym_sec_characters %>%
dfm()textplot_wordcloud(dfm_general_sec_characters,
random_order = FALSE,
rotation = 0.25,
min_size = 1, max_size =5,
labelsize = 1.5,
min_count = 1, #Minimum frequency
color = RColorBrewer::brewer.pal(8, "Spectral"))For example, here, we specified another type of palette. The plotted names are just secondary characters because we removed the principal. Zoey, Stella, and Victoria are Ted’s ex-girlfriends.

spacyr provides a convenient R wrapper around the Python spaCy package. It offers easy access to the following functionality of spaCy. This package is amazing because here what spacyr is doing is clasifying automatically our words into nouns, adjectives, verbs, dates and much more. Of course, it is not 100% accurate, but it is an amazing tool to do some analysis!
![]()
#Be patient because it takes around 5-10 minutes to do the installation. Also, please follow the steps marked on your monitor when you are installing the packages.
library(spacyr)
spacy_install()
spacy_initialize(model = "en_core_web_sm")
sp_parse_doc <- spacy_parse(df_himym_final_doc, tag=TRUE)This is our output. Look how the package separate our words automatically and also classified them as nouns, verbs, names, etc. It’s amazing!! Of course, the classification is not 100% accurate, but it gives us a good idea for know different things about our texts.
sp_parse_docsp_parse_var <- full_join(sp_parse_doc, df_himym_final_doc, by = c("doc_id"))
#In this case, we will just look the proper names and adjectives.
sp_parse_var_PROPN <- sp_parse_var %>% filter(pos=="PROPN" & stringr::str_starts(entity, "PERSON_B"))
sp_parse_var_ADJ <- sp_parse_var %>% filter(pos=="ADJ")We will get a wordcloud using the spacYr output. We will divide the output into adjectives and other features. Please, check the package, you will not regret it.
#Remember our second step: tokenize our corpus.
toks_himym_ADJ <- tokens(corp_himym, #corpus from all the episodes from all season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_keep(c(unique(sp_parse_var_ADJ$lemma))) %>% #We want to keep all the adjective
tokens_remove(c(stopwords("english"), "oh", "yeah", "okay", "like",
"get", "got", "can", "one", "hey", "go",
"Ted", "Marshall", "Lily", "Robin", "Barney", "just",
"know", "well", "right", "even", "see",
"sure", "back", "first", "said", "maybe", "wedding",
"whole", "wait")) #But we remove stopwords and other words that the package didn't classify it correctly. #Remember our third step: DFM object
df_general_ADJ <- toks_himym_ADJ %>%
tokens_group(groups = Season_w) %>% #group by season
dfm() %>% dfm_subset(Season < 9)Look how amazing are the adjectives distributed into the 8 seasons. Unfortunately the function only allows us 8 groups. Every color are the adjectives available in the different seasons.
textplot_wordcloud(df_general_ADJ,
random_order = FALSE,
rotation = 0.25,
comparison = TRUE,
labelsize = 1.5,
min_count = 1, #Minimum frequency
color = ggsci::pal_lancet(palette = "lanonc"))
We will get a frequency of adjectives using the spacYr output. Again, we can do amazing things in terms of analysis.
#Remember our second step: tokenize our corpus.
freq_gen_dfm <- toks_himym_ADJ %>%
dfm()#Generate dataframe
df_freq_gen_dfm <- as.data.frame(textstat_frequency(freq_gen_dfm, # Our DFM object
n = 10, #Number of observations displayed
groups = Season)) #Grouped by season
df_freq_gen_dfm_match <- df_freq_gen_dfm %>% mutate(total = 1) %>%
group_by(feature) %>%
summarise(total = sum(total)) %>%
filter(total== 9)
df_freq_gen_dfm_final <- right_join(df_freq_gen_dfm, df_freq_gen_dfm_match,
by = "feature") %>% rename(Word = feature) %>%
mutate(Word = str_to_title(Word))Look the frequency of adjectives. It’s incredible how the word sorry appears and tends to be the one that our beautiful characters keep using. Maybe they did awful things? You need to see the TV Show and decide by yourself.
ggplot2 <- ggplot(df_freq_gen_dfm_final, aes(x = group, y = frequency, group = Word, color = Word)) +
geom_line(size = 1.5, show.legend = TRUE) +
scale_color_manual(values = rev(brewer.pal(n = 7, name = "Dark2"))) +
geom_point(size = 3.2) +
theme_minimal(base_size = 14) +
labs(x = "Number of Season",
y = "Frequencies of words",
title = "Frequency of adjectives",
caption="Description: This plot shows the top adjectives that appears in every season of HIMYM")+
theme(panel.grid.major=element_line(colour="#cfe7f3"),
panel.grid.minor=element_line(colour="#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption=element_text(size=12, hjust=.1, color="#939393"),
legend.position="bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 40, # Bottom margin
l = 10), # Left margin
text=element_text()) +
#geom_segment(aes(x = 8.5, y = 75, xend = 8.8, yend = 70),
# arrow = arrow(length = unit(0.1, "cm")))+
guides(colour = guide_legend(ncol = 4)) +
gghighlight(max(frequency) > 140,
keep_scales = TRUE,
unhighlighted_params = list(colour = NULL, alpha = 0.2))
ggdraw(ggplot2) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
How the characters are related to each other? We will find it with the amazing function network plot.
#Remember our second step: tokenize our corpus.
token_characters_himym <- tokens(corp_himym, #corpus from all the episodes from all season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_keep(c(unique(df_characters_w$name))) %>% #We want to keep all the characters
#Remember the characters that we web scraped before? Here we are suing that vector to filter characters!
tokens_tolower() #We want lower cases in our tokens#Extra step: create a feature co-ocurrence matrix (FCM)
fcm_characters_himym <- token_characters_himym %>%
fcm(context = "window", window = 5, tri = FALSE)#Vector with all the characters
v_top_characters <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym, 70)))
set.seed(100)
textplot_network(fcm_select(fcm_characters_himym, v_top_characters),
edge_color = "#008eed",
edge_size = 2,
vertex_labelcolor = "#006fba",
omit_isolated = TRUE,
min_freq = .1)
As we expected, the network is around the principal characters, but also we can appreciate how the characters are related to each other. Here, we include every single person that appears on the show. In visualization terms, it could be a mess because we see a lot of lines. Let’s filter this FCM object again to analyze the 30 principal characters according to the frequency.
As we said before, if we want to be more specific, we can reduce our network plot to 30 characters. We will follow the same steps s back but filter to have the top features of the first 30 characters.
#Vector with 30 characters
v_top_characters_2 <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym, 30)))
textplot_network(fcm_select(fcm_characters_himym, v_top_characters_2),
edge_color = "#008eed",
edge_size = 5,
vertex_labelcolor = "#006fba",
omit_isolated = TRUE,
min_freq = .1)
If we want to be even more specific, we can reduce our network plot and weight it with just one character. In this case, Ted. We are “weighting” this network plot because we want to see the density of how many times that character is related to Ted. We can do this with the previous plots to check how the network changes.
fcm_characters_himym_ted <- token_characters_himym %>%
tokens_remove(c("marshall", "lily", "barney", "robin")) %>% #Here we just want ted, that why we remove the other principal characters
fcm(context = "window", window = 5, tri = FALSE)
#Vector with 30 characters
v_top_characters_3 <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym_ted, 30)))
#Create a FCM matrix with our characters
vertex_size_f <- fcm_select(fcm_characters_himym_ted, pattern = v_top_characters_3)
#Create a proportion
v_proportion <- rowSums(vertex_size_f)/min(rowSums(vertex_size_f))
#Vector of Ted
x_p <- c("ted")
#Replace that proportion in our vector
final_v <- replace(v_proportion, names(v_proportion) %in% x_p,
v_proportion[names(v_proportion) %in% x_p]/15)textplot_network(fcm_select(fcm_characters_himym_ted, v_top_characters_3),
edge_color = "#008eed",
edge_size = 5,
vertex_labelcolor = "#006fba",
omit_isolated = TRUE,
vertex_labelsize = final_v,
min_freq = .1)
Identify and score multi-word expressions, or adjacent fixed-length collocations, from text using textstat_collocations().
We want to see which phrases are the more used ones in the context of the first season. This is a simple step to understanding how vital some compound phrases can be.
#Remember our second step: tokenize our corpus.
toks_himym_s1 <- tokens(corp_himym_s1_simil, #Define our corpus for the first season
padding = TRUE) %>% #Leave an empty string where the removed tokens previously existed
tokens_remove(stopwords("english")) #Remove stopwords of our tokenhimym_s1_collocations <-textstat_collocations(toks_himym_s1, #Our token object
tolower = F) #Keep capital letters
df_himym_s1_coll <- data.frame(himym_s1_collocations) %>%
rename(`Total of collocations` = count)Good! look what collocations are like right now, get married and party number are the most used ones in the first season. The Lambda and Z statistics are metrics used to plot the different allocations. Every dot in this graph represents one compound phrase. The size of each dot means how many times the characters said that phrase.
ggplot3 <- ggplot(df_himym_s1_coll, aes(x = z, y = lambda, label = collocation)) +
geom_point(alpha = 0.2, aes(size = `Total of collocations`), color = "#00578a")+
geom_point(data = df_himym_s1_coll %>% filter(z > 15),
aes(x = z, y = lambda, size = `Total of collocations`),
color = '#00578a') +
geom_text_repel(data = df_himym_s1_coll %>% filter(z > 15), #Function from ggrepel package. Show scatterplots with text.
aes(label = collocation, size = count), size = 3,
box.padding = unit(0.35, "lines"),
point.padding = unit(0.3, "lines")) +
scale_y_continuous(breaks = seq(0, 16, by = 1), limits = c(0,16))+
theme_minimal(base_size = 14) +
labs(x = "Z statistic",
y = "Lambda",
title = "Allocations of words in the First Season",
caption = "Description: This plot identifies and scores multi-word expressions of the 1st season")+
theme(panel.grid.major = element_line(colour = "#cfe7f3"),
panel.grid.minor = element_line(colour = "#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption = element_text(size=12, hjust=.1, color="#939393"),
legend.position="bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 10, # Bottom margin
l = 10))
ggdraw(ggplot3) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
What about now to look the most iconics phrases in HIMYM. It’s going to be…wait for it…legendary. We can do that with locate keywords in context.
#Set dataframe to merge with other information--------------------------
df_title_s_chp <- df_himym_final_doc %>%
select(Title, Season, Chapter, No.overall,
Season_w, US.viewers.millions.)
#First step: Define a corpus --------------------------------------
corp_himym <- corpus(df_himym_final_doc) # build a new corpus from the texts
docnames(corp_himym) <- df_himym_final_doc$Title #Rename docnames with Title of the episode
corp_himym_s5 <- corpus_subset(corp_himym, #our corpus
Season == 5) #Filter by seasonAn example with the word: Love. Because this TV Show talks about love, let’s find this word in the context of different episodes. We will use the fifth season to locate the word in context “Love”.
toks_himym_s5 <- tokens(corp_himym_s5, #Corpus of season 5
padding = TRUE)kw_himym_s5_love <- kwic(toks_himym_s5, #token object.
pattern = "love*", #pattern that we want to look for.
window = 10) #how many words you want before and after your pattern.df_kw_himym_s5_love <- as.data.frame(kw_himym_s5_love) %>%
rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>%
rename_with(str_to_title, .cols = everything()) %>% left_join(df_title_s_chp,
by ="Title") %>%
relocate(Title, Season, Chapter)
df_kw_himym_s5_loveThat’s amazing: it’s seems that the word love appears 150 times just in the fifth season.
Now, let’s do another example with the word: legendary. We will search this word but for all seasons.
toks_himym <- tokens(corp_himym, #Define our corpus for all seasons
padding = TRUE) #Leave an empty string where the removed tokens previously existedkw_himym_legendary <- kwic(toks_himym, #token object.
pattern = "legendary*", #pattern that we want to look for.
window = 10) #how many words you want before and after your pattern.df_kw_himym_legendary <- as.data.frame(kw_himym_legendary) %>%
rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>%
rename_with(str_to_title, .cols = everything()) %>% left_join(df_title_s_chp,
by = "Title") %>%
relocate(Title, Season, Chapter)
df_kw_himym_legendaryMmmhhh, we tought that the word legendary was going to appeared more. Maybe they didn’t mention that so often.
Also, we can even do phrases like: Wait for it. But don’t worry, you don’t need to wait us. We are here.
kw_himym_wait_for <- kwic(toks_himym, #token object.
pattern = phrase("wait for it"), #Here we can specify even a phrase
window = 10) #how many words you want before and after your pattern.df_kw_himym_wait_for <- as.data.frame(kw_himym_wait_for) %>%
rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>%
rename_with(str_to_title, .cols = everything()) %>% left_join(df_title_s_chp,
by = "Title") %>%
relocate(Title, Season, Chapter)
df_kw_himym_wait_forIf you can see, according to our rows, the phrase wait for it appears 80 times thorugh the different episodes.
EXTRA: just because we were having a lot of fun with this package. We are going to do a quick sentiment analysis.
toks_himym <- tokens(corp_himym, #Our corpus object
remove_punct = TRUE, #Remove punctuation in our texts
remove_separators = TRUE, #Remove separators in our texts
remove_numbers = TRUE, #Remove numbers in our texts
remove_symbols = TRUE) %>% #Remove symbols in our texts
tokens_remove(stopwords("english"))#Add additional words
#tidy_sou <- df_himym_final_doc %>%
# unnest_tokens(word, text) This is another way on spacyrWe will use the get_sentiments functions from the
tidytext package to get positive and negative words. We
have four sources. We are going to use bing, but you can choose the one
that you like the most. What we are obtaning is just string vectors with
negative and positive words.
df_positive_words <- get_sentiments("bing") %>% #We have four options: "bing", "afinn", "loughran", "nrc"
filter(sentiment == "positive")
df_negative_words <- get_sentiments("bing") %>%
filter(sentiment == "negative")We must define a dictionary to put it into a dictionary and pass it thorugh a dfm object. We know that you are an expert on that now.
#Define a dictionary with positive and negative words from bing --------------------------------------
l_sentiment_dictionary <- dictionary(list(positive = df_positive_words,
negative = df_negative_words))💡 Warning: this functions takes 30 minutes: be patience. Don’t worry, we will charge the dataframe for you.
dfm_sentiment_himym <- dfm(toks_himym) %>% dfm_lookup(dictionary = sentiment_dictionary)We will charge the document for you from the repo that you download it. We got you.
##Load a file
#It is a DFM object, which comes from a token off all the season of HIMYM
load(file = "data/dfm_sentiment_himym.Rdata")
#Rename doc:id with the Titles of every episode
docnames(dfm_sentiment_himym) <- df_himym_final_doc$TitleWe will give a format to our dataframe.
#Format in long to plot positive and negative words
df_sentiment_himym <- convert(dfm_sentiment_himym, "data.frame") %>%
gather(positive.word, negative.word, key = "Polarity", value = "Words") %>%
rename(Title = doc_id) %>%
mutate(Title = as_factor(Title)) %>%
left_join(df_title_s_chp, by ="Title") %>%
mutate(Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "negative.word",
replacement = "Negative words")),
Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "positive.word",
replacement = "Positive words")))
ggplot4 <- ggplot(df_sentiment_himym, aes(x = Chapter, y = Words, fill = Polarity, group = Polarity)) +
geom_bar(stat = 'identity', position = position_dodge(), size = 1) +
facet_wrap(~ Season_w)+
scale_fill_manual(values = c("#c6006f", "#004383")) +
scale_y_continuous(breaks = seq(0, 250, by = 50))+
theme_minimal(base_size = 14) +
labs(x = "Episodes",
y = "Frequency of words",
title = "Total of positive and negative words per season",
caption="Description: This plot identifies total of positive and negative words \n per season and episode")+
theme(panel.grid.major = element_line(colour="#cfe7f3"),
panel.grid.minor = element_line(colour="#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption = element_text(size = 12, hjust = .1, color = "#939393"),
legend.position = "bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 10, # Bottom margin
l = 10))
ggdraw(ggplot4) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
Look the total (raw) words between positive and negative words per season. This is a good metric, but it doesn’t tell us that much. At least, we know how many positive/begative worsd are used in the episodes.
dfm_weight() We can be more fair. Let’s know calculate the weight of the words.
This step is the same as the last one, but here we are taking into account the weights to do a fair comparison.
dfm_sentiment_himym_prop <- dfm_weight(dfm_sentiment_himym, scheme = "prop")
dfm_sentiment_himym_prop## Document-feature matrix of: 208 documents, 4 features (47.72% sparse) and 11 docvars.
## features
## docs positive.word positive.sentiment negative.word
## "Pilot" 0.6621005 0 0.3378995
## "Purple Giraffe" 0.6722222 0 0.3277778
## "Sweet Taste of Liberty" 0.6510417 0 0.3489583
## "Return of the Shirt" 0.5977011 0 0.4022989
## "Okay Awesome" 0.6257310 0 0.3742690
## "Slutty Pumpkin" 0.6267281 0 0.3732719
## features
## docs negative.sentiment
## "Pilot" 0
## "Purple Giraffe" 0
## "Sweet Taste of Liberty" 0
## "Return of the Shirt" 0
## "Okay Awesome" 0
## "Slutty Pumpkin" 0
## [ reached max_ndoc ... 202 more documents ]
We repeat the same proces. It seems that HIMYM is positive after all. Amazing.
df_sentiment_himym_prop <- convert(dfm_sentiment_himym_prop, "data.frame") %>%
gather(positive.word, negative.word, key = "Polarity", value = "Words") %>%
rename(Title = doc_id) %>%
mutate(Title = as_factor(Title)) %>%
left_join(df_title_s_chp, by = "Title") %>%
mutate(Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "negative.word",
replacement = "Negative words")),
Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "positive.word",
replacement = "Positive words")))
### 14.07.02.- Plot total of positive and negative words per season and episode -----
#This step is the same as the last one, but here we are taking into account the weights to do a fair comparison
ggplot5 <- ggplot(df_sentiment_himym_prop, aes(x = Chapter, y = Words, fill = Polarity, group = Polarity)) +
geom_bar(stat = 'identity', position = position_dodge(), size = 1) +
facet_wrap(~ Season_w) +
scale_fill_manual(values = c("#c6006f", "#004383")) +
scale_y_continuous(breaks = seq(0, .8, by = .2))+
theme_minimal(base_size = 14) +
labs(x = "Episodes",
y = "Frequency of words",
title = "Weighted positve and negative words per season",
caption = "Description: This plot identifies the weighted total of positive and negative words \n per season and episode")+
theme(panel.grid.major = element_line(colour = "#cfe7f3"),
panel.grid.minor = element_line(colour = "#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption = element_text(size = 12, hjust = .1, color = "#939393"),
legend.position = "bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 10, # Bottom margin
l = 10))
ggdraw(ggplot5) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
Now let’s do a rate to check in which episodes it can be more a negative context. We will use `Scaling Policy Preferences from Coded Political Texts from WILL LOWE, KENNETH BENOIT, SLAVA MIKHAYLOV, MICHAEL LAVER (2010).
They use a balance between positive words/negative words using a log scale, which you can see on the code.
Here is their formula to get the proportion:
\[ \Theta^{(L)}=log\frac{R + .5}{N + .5} \] Where R = positive words and N = negative words.
#Here we applied the formula proposed before.
df_sentiment_himym_prop_measure <- convert(dfm_sentiment_himym_prop, "data.frame") %>%
rename(Sentiment = positive.word) %>% rename(Title = doc_id) %>%
left_join(df_title_s_chp, by = "Title") %>%
mutate(measure = log((Sentiment + 0.5)/(negative.word + .5))) %>%
select(-Season) %>%
rename(Season = Season_w)dfm_sentiment_himym_prop <- dfm_weight(dfm_sentiment_himym, scheme = "prop")
dfm_sentiment_himym_prop## Document-feature matrix of: 208 documents, 4 features (47.72% sparse) and 11 docvars.
## features
## docs positive.word positive.sentiment negative.word
## "Pilot" 0.6621005 0 0.3378995
## "Purple Giraffe" 0.6722222 0 0.3277778
## "Sweet Taste of Liberty" 0.6510417 0 0.3489583
## "Return of the Shirt" 0.5977011 0 0.4022989
## "Okay Awesome" 0.6257310 0 0.3742690
## "Slutty Pumpkin" 0.6267281 0 0.3732719
## features
## docs negative.sentiment
## "Pilot" 0
## "Purple Giraffe" 0
## "Sweet Taste of Liberty" 0
## "Return of the Shirt" 0
## "Okay Awesome" 0
## "Slutty Pumpkin" 0
## [ reached max_ndoc ... 202 more documents ]
Plot measure of positivity among seasons
Woooow! We confirm that is a positive Show, but it’s interesting how certain episodes, mostly from the last season, have a negative context. This total makes sense because by that time Lily was fighting with Marshall for their baby and Robin, Ted and Barney were having problems (love triangle).
df_sentiment_himym_prop <- convert(dfm_sentiment_himym_prop, "data.frame") %>%
gather(positive.word, negative.word, key = "Polarity", value = "Words") %>%
rename(Title = doc_id) %>%
mutate(Title = as_factor(Title)) %>%
left_join(df_title_s_chp, by = "Title") %>%
mutate(Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "negative.word",
replacement = "Negative words")),
Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "positive.word",
replacement = "Positive words")))
ggplot6 <- ggplot(df_sentiment_himym_prop_measure, aes(x = No.overall, y = measure,
color = Season, group = Season)) +
scale_color_manual(values = brewer.pal(n = 9, name = "Set1"))+
geom_line(size = 1.5) +
geom_point(size = 3.2) +
scale_x_continuous(breaks = seq(0, 208, by = 20))+
theme_minimal(base_size = 14) +
labs(x = "Number of episode",
y = "Rate",
title = "Measure of positivity among episodes",
caption="Description: This plot shows the positivity rate of every episode")+
theme(panel.grid.major = element_line(colour = "#cfe7f3"),
panel.grid.minor = element_line(colour = "#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
plot.caption = element_text(size=12, hjust = .1, color = "#939393"),
legend.position = "bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 40, # Bottom margin
l = 10), # Left margin
text = element_text()) +
guides(colour = guide_legend(ncol = 3)) +
geom_hline(yintercept = 0, linetype = "dashed",
color = "red", size = 1)
ggdraw(ggplot6) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
Thanks for your attention and we hope you find this material useful. If you have any question, please reach us. You have all of our information on the repo.
Have a good day!
A Beginner’s Guide to Text Analysis with quanteda (University of Virginia)
Amazing document created by Kenneth Benoi (University of Münster)
An Introduction to Text as Data with quanteda (Penn State and Essex courses)
Text as Data: quantitative text analysis with R. Data Science Summer School 2022. Hertie School
Text as data: Avatar Kenneth Benoit. Director, LSE Data Science Institute
All rights of each image to whom they correspond.

A work by Jorge Roa, Augusto Fonseca & Alexander KRaess
Prepared for Intro to Data Science Workshop 2022